Genome Biology — Latest Matching Preprints

1

X-Plat: A polynomial regression based tool for cross-platform transformation of expression and methylation data

Krishnan, N. M.; Rahman, S. I.; Olsen, L. R.; Panda, B.

2026-03-30 genomics 10.64898/2026.02.22.707273 medRxiv

Top 0.1%

17.2%

Show abstract

Many biological studies could benefit from combining data from legacy microarray and high throughput sequencing platforms, especially in clinical domains where collecting additional samples is not possible. However, incompatibility between platforms makes legacy data difficult to integrate, owing to differences in platform design, target preparation, and dependence on prior annotations. Here, we describe X Plat, a cross platform data transformation tool for both expression and methylation assays that inter converts data between microarray and sequencing platforms using per gene second degree polynomial regression. X Plat learns cross platform conversion rules from paired microarray sequencing datasets spanning multiple conditions, sample sources, organisms, and platforms, and evaluates performance using cross validated root mean square error (RMSE) per gene. In rat, Arabidopsis, and human datasets, X Plat achieved lower cross validated RMSE than TDM, HARMONY, and HARMONY2 for the vast majority of genes (equal to or greater than 95% in all sequencing to array transformations and most array to sequencing transformations, with nearly 82% in the Arabidopsis array to sequencing setting), and these findings were confirmed using RMSE on held out test samples from the first cross validation fold. X Plat also achieved low RMSE (less than or equal to 0.2) for the majority of CpG regions in paired human array and sequencing methylation datasets. Using X Plat, users can transform data between microarray and high throughput sequencing platforms, enabling cross platform comparison and reuse of legacy cohorts.

2

Histone Modification Metapeaks are Epigenetic Landmarks Predictive of Cell State

Tanner, R. M.; Perkins, T. J.

2026-04-02 genomics 10.64898/2026.03.31.715657 medRxiv

Top 0.1%

14.8%

Show abstract

Histone modifications are a key component of the epigenetic state of a cell, and they vary widely across different cell and tissue types, conditions, and disease states. Indeed, the majority of the genome is enriched with one histone mark or another across the thousands of cellular conditions that have been studied to date. Here, we use the largest-to-date collection of histone modification ChIP-seq datasets to identify the most important sites of histone modifications genome-wide. Collected and uniformly reprocessed by the International Human Epigenome Consortium, this data includes 5339 datasets enriched at nearly one billion total peaks across 59 different major cell or tissue types and in healthy and disease conditions, for six different histone marks. We propose FindMetapeaks, a new approach to identifying histone mark metapeaks, which are genomic regions with enrichment of a mark across many samples. We show that many of these epigenetic metapeaks are strongly indicative of cell and tissue type, or are associated with other sample characteristics, and highlight key regulatory regions of the genome. However, we also show that many metapeaks contain redundant information, and that parsimonious subsets of metapeaks can be selected by machine learning to predict cell state. Our histone mark metapeak atlas provides a concise set of regions for interpreting the epigenome. Availabilityhttps://github.com/rmbioinfo83/FindMetapeaks/

3

Ancestral Genome Reconstruction.

Siguret, C.; Olivier, M.; Huneau, C.; SOW, M. D.; Stenger, P.-L.; Klopp, C.; Martin, M.-L.; Tamby, J.-P.; Civan, P.; Pont, C.; Mathieu, O.; SALSE, J.

2026-04-16 genomics 10.64898/2026.04.16.718917 medRxiv

Top 0.1%

14.6%

Show abstract

AGR, for Ancestral Genome Reconstruction, is an automatic publicly available and open-source pipeline to infer paleogenomes from modern species genome comparisons exploiting the concept of inter-species chromosomal synteny relationships hierarchical clustering that can be used to unveil how ancestral genomes, genes, sequences and functions have been shaped during million years of present-day plant evolution.

4

Benchmarking ambient RNA removal across droplet and well-plate platforms reveals artificial count generation as a critical failure mode of scAR and CellClear

Schroeder, L.; Gerber, S.; Ruffini, N.

2026-04-10 bioinformatics 10.64898/2026.04.08.717130 medRxiv

Top 0.1%

14.5%

Show abstract

BackgroundAmbient RNA contamination is a pervasive artifact of single-cell and single-nucleus RNA sequencing (sxRNA-seq), yet no consensus exists on which computational removal tool performs best across experimental platforms. ResultsWe present a systematic benchmark of six tools: CellBender, DecontX, SoupX, scCDC, scAR, and CellClear - evaluated across six human-mouse cell line mixing (hgmm) datasets (1k-20k cells) providing partial ground truth, two droplet-based complex tissue datasets (PBMC scRNA-seq; prefrontal cortex snRNA-seq), and a well-plate-based dataset (BD Rhapsody WBC). Using inter-species counts as partial ground truth, we quantify sensitivity, specificity, precision, and removal consistency per tool. We further apply a count-integrity criterion quantifying gene-cell positions where corrected values exceed raw counts. This reveals that scAR and CellClear do not merely denoise but fundamentally restructure count matrices: CellClear replaces >93% of counts with values derived from matrix factorization, while scAR generates spurious cell types absent from uncorrected data, including three spurious coarse cell types in the BD Rhapsody dataset and up to eight novel cell types in the prefrontal cortex. CellBender and SoupX exhibit reliable contamination removal with minimal count distortion. DecontX and scCDC are the only tools operable on non-droplet platforms without raw count matrix access. Runtime benchmarking at atlas scale (up to 172,000 nuclei) further demonstrates that CellClear fails to scale. ConclusionsCount matrix integrity, not removal sensitivity alone, must be a primary criterion when selecting ambient RNA correction tools. We provide platform-specific recommendations and a decision framework to guide tool selection across experimental contexts.

5

vcfilt: A Zero-Allocation Streaming Filter for High-Throughput VCF Processing

KP, M. M.

2026-04-16 bioinformatics 10.64898/2026.04.14.718370 medRxiv

Top 0.1%

14.5%

Show abstract

Variant Call Format (VCF) files are the dominant interchange format for genomic variant data, but their size - routinely exceeding tens of gigabytes for population-scale studies - creates a significant computational bottleneck at the quality-filtering stage. Existing tools such as bcftools and vcftools provide broad functionality through general-purpose expression engines, but incur substantial per-record overhead from dynamic field lookup, type resolution, and heap allocation. We present vcfilt, a streaming, batch-parallel VCF filter implemented in Go that restricts its scope to three high-frequency filter criteria (INFO/DP, INFO/AF, and QUAL) and applies them via a zero-allocation byte-scan parser. Benchmarked on real 1000 Genomes Project data (chromosome 20, 1,811,146 variants), vcfilt achieves 147,000 variants/second on an 18 GB plain-text VCF file using a single thread - a 12.2x speedup over bcftools 1.18 under identical conditions. On gzip-compressed input, the speedup is 7.9x. Output is byte-for-byte identical to bcftools across all tested filter combinations. vcfilt is distributed as a self-contained static binary, a Docker image, and a Singularity-compatible container. The source code and all benchmark scripts are openly available under the MIT licence.

6

DNAharvester: A Nextflow Pipeline for Analysing Highly Degraded DNA from Ancient and Historical Specimens

Sharif, B.; Kutschera, V. E.; Oskolkov, N.; Guinet, B.; Lord, E.; Chacon-Duque, J. C.; Oppenheimer, J.; van der Valk, T.; Diez-del-Molino, D.; D. Heintzman, P.; Dalen, L.

2026-04-21 bioinformatics 10.64898/2026.04.20.719564 medRxiv

Top 0.1%

14.4%

Show abstract

Ancient DNA (aDNA) research has advanced rapidly with the development of high-throughput sequencing, which now enables genome-wide analyses of large collections of prehistoric specimens. However, analysing palaeontological and archaeological material with highly degraded DNA constitutes a major bioinformatic challenge. DNA from such samples is characterised by short fragment lengths, low endogenous content, post-mortem damage, and considerable cross-species contamination, which can increase spurious mapping and reference bias, affecting downstream population genetic inferences. Here we present DNAharvester, a modular and reproducible pipeline designed specifically for the processing of highly degraded DNA from ancient and historical specimens. DNAharvester integrates metagenomic filtering before mapping, competitive mapping, adaptive aligner selection (incorporating algorithms such as BWA-aln, BWA-mem, and Bowtie2), and systematic evaluation of reference bias and spurious mapping. By incorporating flexible mapping and filtering strategies, the pipeline can be adapted to varying sample preservation, with a distinct focus on maximising authentic data recovery from highly degraded material. Furthermore, DNAharvester features comprehensive subworkflows for iterative assembly of mitogenomes, identification of genomic repeats and CpG sites, taxonomic classification, microbial/pathogen screening of unmapped reads, genetic sex determination, and variant calling for downstream analyses. To accommodate datasets with varying sequencing depths, the pipeline incorporates multiple variant calling strategies, including diploid variant calling, genotype likelihood estimation, and pseudo-haploid random allele calling. Implemented in Nextflow, DNAharvester provides a highly scalable, containerised framework that enhances reproducibility, portability, and robustness in aDNA analyses. We validated the pipeline across a gradient of simulated scenarios and empirical datasets, demonstrating its ability to systematically mitigate complex background contamination while preserving authentic genomic signals even in the most challenging of circumstances. By streamlining complex bioinformatic tasks through simple configuration files, DNAharvester establishes a standardised approach for the rigorous analysis of highly degraded DNA datasets and makes genomic analyses of ancient remains accessible to the broader research community.

7

New genetic codes in bacteria and archaea identified with a fast k-mer based algorithm

Melnykov, A. V.

2026-04-06 bioinformatics 10.64898/2026.04.02.715157 medRxiv

Top 0.1%

14.3%

Show abstract

The genetic code is conserved across all domains of life and is often described as universal. Nevertheless, many exceptions to the "universal" code have now been documented, most of these through manual or semiautomated inspection of highly conserved genes. Modern bioinformatics tools improved our ability to find alternative genetic codes but remain computationally expensive preventing widespread use on thousands of new species identified by sequencing environmental samples. Here I report a >100 fold accelerated method for inferring the genetic code directly from assembled genomes and apply it to thousands of previously uncharacterized assemblies from archaea and bacteria. I describe new candidate genetic code variations in both domains, including the first archaea sense codon reassignment. Identifying genetic code variations is important for understanding evolution of the standard code and improving accuracy of protein databases and open reading frame identification.

8

CHORD: a framework for cross-species single-cell integration across gene, cell and cell-type levels

Lin, Y.; Zhu, X.; Zhou, X.; Zhang, X.; Cai, G.; Zhao, W.; Zhou, J.; Liu, J.; Zhu, Q.; Zhang, M.; Zhou, B.; Gu, X.; Zhou, Z.

2026-04-22 bioinformatics 10.64898/2026.04.19.719426 medRxiv

Top 0.1%

14.2%

Show abstract

Quantifying cross-species relationships among cell types from single-cell transcriptomic data can reveal both conserved and divergent patterns of cell-type hierarchies. However, existing cross-species integration methods can be limited in modeling genes beyond orthologs by leveraging cell-type-resolved transcriptional context, or in learning explicit type-level representations. Here we present CHORD, a cross-species integration framework that jointly learns representations of genes, cells and cell types. We demonstrate that CHORD can integrate cross-species single-cell atlases and support cell-type annotation with unknown cell-type detection. In the frog-zebrafish embryogenesis and mammalian motor cortex atlases, CHORD infers cell-type trees that place conserved cell types from different species in relative proximity and summarize hierarchical relationships among cell types. CHORD also supports cross-species comparison of continuous phenotypic variation by placing embryonic cells along an aligned developmental timeline. CHORD further yields gene embeddings that capture orthologous and functional relationships, and gene importance scores linking genes to cell types.

9

Deep-Plant: a supervised foundation model for plant regulatory genomics

Daoud, A.; Roy, S.; Zeng, H.; Bao, X.; Zhang, Z.; Wang, J.; Parodi, P.; Reddy, A.; Liu, J.; Ben-Hur, A.

2026-04-09 genomics 10.64898/2026.04.06.716755 medRxiv

Top 0.1%

14.2%

Show abstract

Large-scale sequence-to-function deep learning models have demonstrated unparalleled ability to model biological sequences and have revolutionized the field of regulatory genomics. However, the majority of such efforts have centered on human and mammalian systems, leaving plant regulatory genomics comparatively underexplored. To address this gap, we introduce DO_SCPLOWEEPC_SCPLOW-PO_SCPLOWLANTC_SCPLOW, a supervised foundation model trained to predict chromatin state directly from genomic sequence. In contrast to large language models, which are trained in a selfsupervised manner using sequence alone, our model is trained to predict chromatin state across tissues and conditions. Training the model on a large collection of genome-wide experiments including DNA accessibility, transcription factor binding, and histone modifications, provides it with added biological context beyond the sequence itself. We demonstrate that the resulting model is an effective platform for developing accurate models of regulatory activity relevant to gene expression and active enhancers, exhibiting large improvements in speed, accuracy, and interpretability over the complementary approach of fine-tuning DNA language models. DO_SCPLOWEEPC_SCPLOW-PO_SCPLOWLANTC_SCPLOW models are available in Arabidopsis and rice, and work well as a building block for sequence modeling in related species such as corn. Together, these results establish supervised, chromatin-informed foundation models as a practical and effective paradigm for regulatory sequence modeling in plants.

10

GANGE: Achieving Sequencing Without Sequencing With Diffusion Guided Generative Genomic Transformer

Gupta, S.; Kumar, A.; Bhati, U.; Shankar, R.

2026-04-17 bioinformatics 10.64898/2026.04.15.718133 medRxiv

Top 0.1%

14.0%

Show abstract

The genome of a species is its book of life, but opening that book remains a costly affair due to the limitations the existing sequencing technologies pose. Short reads sequencers struggle to capture long and complex genomes, though have high fidelity rate. To counter that long reads from IIIrd generation sequencers are used, which are full of indel errors. Thus, reads from both approaches are collectively used with very high coverage, making the sequencing projects unreasonably high of cost and unapproachable to majority. Here we present a first of its kind generative deep-learning system, GANGE, which not just recovers the correct sequence with high accuracy from indel prone ONT reads at manifold lesser coverage but also extends it by 4kb, achieving sequencing without sequencing, horizontally as well as vertically while maintaining >92% accuracy consistently. This all makes it possible to drastically pull down sequencing project cost. GANGE was tested across A. thaliana, O. sativa genomes and Human chromosome 1 where it delivered outstanding assembly performance. Besides this, it was also used to accurately generate 2kb upstream promoters of all the genes from 12 different species, demonstrating that one can now also take up regulomics research just using RNA data alone when genome sequence is not available. With this all, GANGE brings a democratic turning point in the area of genomics and sequencing research.

11

A residual-ratio framework for auditing transcriptomic gene signatures against background expression structure

Zhu, Y.; Zhang, C.; Calhoun, V. D.; Bi, Y.

2026-04-14 bioinformatics 10.64898/2026.04.11.717907 medRxiv

Top 0.1%

13.9%

Show abstract

BackgroundTranscriptomic gene signatures are widely used to infer pathway activity and biological mechanism from bulk cancer expression data, yet current evaluation strategies primarily emphasize internal coherence, predictive performance, or scoring robustness. A quantitative framework for assessing how much signature variation remains independent of background expression structure has been lacking. ResultsUnlike existing single-number signature-quality metrics such as Berglund uniqueness, residual-ratio auditing reports a trajectory across null-model richness: for each signature we compute the residual ratio [Formula] at progressively enriched expression-PC subspaces, together with an inverse-participation-ratio (IPR) concentration diagnostic that reports the effective number of axes absorbing each signature. Applied to a curated 17-entry benchmark, all 50 MSigDB Hallmark gene sets, and 1,181 Reactome pathways across 8 TCGA cancer types (4,462 samples), with external validation in METABRIC, the framework produces two complementary readouts. First, the curated panel is absorbed into the ExprPC50 subspace at residual ratios 18-43% below size-matched random 30-gene baselines in every cancer (curated mean r{perp} range 0.109-0.177 vs. random mean 0.182-0.288), providing the frameworks central quantitative discrimination between biologically coherent signatures and arbitrary gene combinations. Second, within the curated panel the ExprPC50 residual ratio is negatively correlated with the top-5 absorption concentration in every cancer (Spearman{rho} from -0.59 in PRAD to -0.89 in SKCM, median -0.71; all 8 significant at p < 0.05, most at p < 10-3); we report this correlation as a descriptive geometric property of the null-model coordinate system rather than as a biological law, because 1,000 random 30-gene draws projected through the same top-50 expression-PC basis reproduce the same pan-cancer median{rho} (-0.73; Supplementary Table S16), and it is robust to compositional nuisance: after rebuilding the null basis as immune-PC1 {oplus} stromal-PC1 {oplus} proliferation-PC1 plus 47 residual PCs, the per-cancer{rho} becomes more negative rather than shallower (median -0.86; Supplementary Table S17), ruling out tumor purity, immune infiltrate, and stromal fraction as drivers of the pattern. Because absorption at ExprPC50 is a geometric property of how any signature direction sits in expression-PC space, tier-level distributional structure at this operating point is not separable beyond the low-vs-upper band split: a Kruskal-Wallis omnibus is significant (p = 4.9 x 10-13), but pairwise Dunns post-hoc tests show that Tiers 1, 4, and 5 are not separable (pBH > 0.2). The trajectory shape itself is empirically bootstrap-invariant: across 200 sample-level fixed-basis bootstrap resamples of the 17 curated entries in BRCA, the mean pairwise Pearson correlation of trajectory-shape vectors is 0.999, and individual cell-level 95% bootstrap CI half-widths at B= 1,000 resamples are in the range 0.002-0.053. External replication in the METABRIC breast cancer cohort (nsamples = 1,980, microarray) showed moderate-to-strong rank-ordering concordance with TCGA-BRCA across the 17 curated entries (Spearman{rho} = 0.72 on the 17-signature ordering, 95% Fisher-z CI 0.37-0.89, p = 0.001). Under an upper-bound sensitivity analysis, 45 of 50 Hallmark gene sets and 992 of 1,181 Reactome pathways had ExprPC200 residual ratios below the mean of their size-matched random baselines--a descriptive statistic reflecting axis alignment under rich null models, not a failure rate. In causal DAG simulations (nrep = 100 replicates), a signature driven entirely by a latent confounder retained r{perp} = 0.233 at ExprPC50, numerically comparable to Tier 1 validated drivers, so a single-point residual ratio cannot adjudicate confounder-independence. The frameworks load-bearing signals are therefore the trajectory shape (statistically invariant under sample-level resampling) and the magnitude gap between the curated panel and its random 30-gene baseline (the curated-vs-random discrimination), read jointly--not the value of r{perp} at any single null-model dimensionality. O_TBL View this table: org.highwire.dtl.DTLVardef@1b56848org.highwire.dtl.DTLVardef@d18636org.highwire.dtl.DTLVardef@1c26db4org.highwire.dtl.DTLVardef@1b0620corg.highwire.dtl.DTLVardef@f507d2_HPS_FORMAT_FIGEXP M_TBL O_FLOATNOTable S16:C_FLOATNO O_TABLECAPTIONRandom-gene-set null for the ExprPC50 r{perp}-vs-c(5) correlation, and curated-panel absolute gap vs random baselines. For each of the 8 primary-analysis TCGA cancer cohorts, we drew B = 1,000 random 30-gene sets from the gene universe of the preprocessed expression matrix and computed, for each draw, the residual ratio r{perp}(k = 50) and the top-5 absorption concentration c(5) under the same top-50 sample-space PC basis used for the curated benchmark (Methods [§]Statistical analysis; reference implementation accompanies the project repository as script 35). Column "Empirical curated{rho} " repeats the 17-signature Spearman{rho} between r{perp} and c(5) reported in the main text and in Supplementary Table S10; column "Null-A{rho} (random 30-gene)" gives the Spearman{rho} across the 1,000 random-draw (r{perp}, c(5)) pairs per cancer; column "{Delta} (emp null-A)" reports the difference. For reference, column "Null-B{rho} " gives the corresponding Spearman{rho} across 1,000 iid Gaussian unit vectors h (0, IN) in sample space (a uniform random direction that does not inherit the expression covariance geometry). Rightmost columns compare the curated 17-entry panels mean r{perp} at ExprPC50 to the random 30-gene baseline mean, both in absolute units and as a percentage gap; this magnitude gap is the quantitative discrimination between curated biological signatures and arbitrary gene combinations on which the frameworks central claim rests. C_TABLECAPTION C_TBL O_TBL View this table: org.highwire.dtl.DTLVardef@d4a1d3org.highwire.dtl.DTLVardef@1cc3baaorg.highwire.dtl.DTLVardef@1611aecorg.highwire.dtl.DTLVardef@2e9c76org.highwire.dtl.DTLVardef@220711_HPS_FORMAT_FIGEXP M_TBL O_FLOATNOTable S17:C_FLOATNO O_TABLECAPTIONPurity-aware null for the curated-panel ExprPC50 r{perp}-vs-c(5) correlation. For each cancer we rebuild a rank-50 sample-space basis as follows. Columns 1-3 are the PC1 directions of: an immune-infiltrate proxy panel (CD3D, CD3E, CD4, CD8A, CD8B, CD19, CD68, PTPRC, FOXP3, IFNG); a stromal/fibroblast proxy panel (COL1A1, COL1A2, COL3A1, VIM, FN1, ACTA2, PDGFRA, PDGFRB); and the 50 proliferation markers already used in the null-model hierarchy ([§]Null model hierarchy). Columns 4-50 are the top-47 PCs of the residual expression matrix Y - QbioQ Y after QR-orthonormalization of the 3-column biological block; the full 50-column basis is re-orthonormalized by a final QR pass. The 17 curated benchmark entries are then re-scored under this purity-aware basis, and the per-cancer Spearman{rho} between r{perp} and c(5) is recomputed. Column "Standard{rho} " is the value reported in the main text (same as Supplementary Table S16, column "Empirical curated"); column "Purity-adjusted{rho} " is the value under the new basis; column "{Delta}" is the difference. This test was pre-specified as a check on whether the observed{rho} (r{perp}, c(5)) could be driven by tumor-purity, immune, or stromal composition artifacts that a standard top-50 PC basis would implicitly absorb. C_TABLECAPTION C_TBL ConclusionsResidual-ratio auditing provides an interpretable and practical framework for quantifying how much of a transcriptomic gene signatures variance remains orthogonal to a chosen background-expression model. The two statistically reliable quantities it reports are (i) the shape of the trajectory r{perp}(k) across null-model richness, which is bootstrap-invariant across sample-level resamples, and (ii) the magnitude gap between the curated panels residual ratio and size-matched random 30-gene baselines at a fixed operating point, which is 18-43% in all 8 TCGA cancers and survives a purity-aware null-model construction. The negative correlation between r{perp} and the top-5 absorption concentration c (curated-panel median{rho} = -0.71) is reproduced by random 30-gene sets under the same basis (random-draw median{rho} = -0.73) and is therefore best read as a descriptive geometric property of the null-model coordinate system rather than a biological discovery about curated signatures. Any single operating-point residual ratio carries materially wider cell-level uncertainty than the trajectory shape and cannot, on its own, adjudicate confounder-independence. The frameworks outputs describe a signatures geometric relationship to modeled background expression structure and do not evaluate clinical utility: a signature with a low residual ratio may still be clinically valuable when that low value reflects alignment with a strong prognostic or actionable program such as proliferation, immune infiltration, or cell cycle, and the framework is not a substitute for calibrated prognostic or predictive classifiers. All findings are based on bulk RNA-seq (TCGA PanCancer Atlas, 8 cancer types) and microarray (METABRIC) data; transfer to single-cell, single-nucleus, or spatial transcriptomics is out of scope and not claimed. Used within this scope--reading the trajectory shape and the magnitude-gap signal jointly, rather than the value of r{perp} at any one k--the framework adds a complementary audit layer to existing pathway-scoring and experimental-validation workflows, and supports more calibrated interpretation, comparison, and reporting of transcriptomic gene signatures in cancer studies.

12

scSketch: Interactive Sketch-based Trajectory Exploration and Pathway-Aware Analysis of Single-Cell Data

Temirbek, A.; Lekschas, F.; Sankaran, K.; Colubri, A.

2026-04-21 bioinformatics 10.64898/2026.04.16.718997 medRxiv

Top 0.2%

12.6%

Show abstract

Interactively exploring gene expression gradients across low-dimensional cell embeddings is central to single-cell RNA sequencing analysis, yet there arent tools that allow users to sketch trajectories and interactively compute pathway-level interpretation. We present scSketch, a tool that enables users to iteratively explore and test trajectory hypotheses in single-cell data while maintaining statistical validity and biological interpretability. Specifically, users apply interactive directional sketching to draw trajectories across embeddings and probe continuous processes such as cellular differentiation and cell state transitions. scSketch automatically computes gene-trajectory correlations and applies online false discovery rate (FDR) control to maintain statistical validity during iterative exploration. Significant genes are grouped into Reactome pathways for contextual interpretation. Applied to human oral keratinocytes infected with human cytomegalovirus, scSketch revealed infection-associated gradients involving interferon responses, metabolic remodeling and autophagy. Together, these features position scSketch as a bridge between exploratory visualization and mechanistic insight in single-cell biology. Pseudocode and full algorithm details for online FDR and interactive directional sketching are available in Supplementary Methods S1 and S2.

13

MitoChontrol: Adaptive mitochondrial filtering for robust single-cell RNA sequencing quality control

Strassburg, C.; Pitlor, D.; Singhi, A. D.; Gottschalk, R.; Uttam, S.

2026-04-07 bioinformatics 10.64898/2026.04.04.716517 medRxiv

Top 0.2%

12.6%

Show abstract

SummaryMitochondrial transcript abundance is a standard quality control metric in single-cell RNA sequencing, but fixed percentage thresholds fail to account for the substantial variation in mitochondrial content across cell types and tissues, risking both retention of compromised cells and exclusion of transcriptionally active viable cell populations. We present MitoChontrol, a cell-type-aware probabilistic framework for mitochondrial quality control that models the mitochondrial transcript fraction within transcriptionally coherent clusters as a Gaussian mixture distribution. Compromised-cell components are identified from the upper tail of each cluster-specific distribution, and filtering thresholds are defined as the point at which the posterior probability of cellular compromise exceeds a user-definded confidence value. Applied to controlled perturbation experiments and a pancreatic ductal adenocarcinoma single-cell dataset, MitoChontrol selectively removes transcriptionally compromised cells while preserving biologically elevated but viable populations, outperforming fixed-threshold and outlier-based approaches. Availability and ImplementationMitoChontrol is implemented in Python and integrates directly with AnnData-based workflows. It is freely available under the GNU General Public License v3 (GPL-3.0) at: https://github.com/uttamLab/MitoChontrol (DOI: https://doi.org/10.5281/zenodo.19423054)

14

scDisent: disentangled representation learning with causal structure for multi-omic single-cell analysis

Xi, G.

2026-04-16 bioinformatics 10.64898/2026.04.12.717909 medRxiv

Top 0.2%

12.4%

Show abstract

Single-cell multi-omic technologies measure complementary aspects of cellular identity and regulatory state, yet most integration models compress these signals into one entangled latent space. Such representations are useful for clustering but poorly suited for mechanistic interpretation or perturbation-oriented analysis. We present scDisent (https://github.com/xiguoren/scDisent), a generative framework for disentangled representation learning that separates expression-associated variables (zexpr) from regulation-associated variables (zreg) and links them through a sparse directed mapping. scDisent combines modality-specific encoding, variational disentanglement with total-correlation and orthogonality constraints, and a Gumbelgated causal module protected by detach-based gradient isolation. Evaluated on benchmark datasets with matched modalities, scDisent achieved best-in-benchmark integration performance while exposing regulatory structure that competing integration methods do not model explicitly. The learned causal atlas remained sparse, perturbation analyses recovered biologically coherent lineage-associated programs, and cross-dataset discovery analyses highlighted interpretable immune, neural and developmental signatures. Quantitative branch-separation analyses further showed that benchmark-label information concentrated in zexpr rather than zreg. Together, these results position scDisent as a computational method that improves not only integration quality but also biological interpretability, making single-cell multi-omic representations better suited to biological question answering and in silico hypothesis generation.

15

CardamomOT: a mechanistic optimal transport-based framework for gene regulatory network inference, trajectory reconstruction and generative modeling

Mauge, Y.; Ventre, E.

2026-04-02 bioinformatics 10.64898/2026.03.31.715390 medRxiv

Top 0.2%

12.4%

Show abstract

A key challenge in inferring gene regulatory networks (GRNs) governing cellular processes such as differentiation and reprogramming from experimental data lies in the impossibility of directly measuring protein dynamics at the single-cell level, which prevents establishing causal relationships between regulator activity and target responses. In earlier work, we introduced CARDAMOM, an algorithm that uses temporal snapshots of scRNA-seq data to calibrate a GRN-driven mechanistic model of gene expression. However, this method had several limitations: it could only rely on the relative ordering of time points rather than their exact labels, imposed restrictive quasi-stationary assumptions on protein dynamics, and depended on multiple hyperparameters. Here, we present CardamomOT, a new method based on the same mechanistic model that jointly reconstructs the GRN and unobserved protein trajectories from the data within a mechanistic optimal transport framework. By incorporating exact time labels and priors on protein kinetic rates from the literature, and substantially reducing the number of required hyperparameters, our approach addresses these limitations and substantially improves the accuracy and robustness of GRN calibration. We validate our framework on both in silico and experimental datasets, demonstrating computational scalability and consistently improved performance over state-of-the-art methods in both GRN and trajectory reconstruction. In particular, CardamomOT accurately recovers velocity fields driving cellular trajectories and unobserved protein levels, alongside reliable GRN structures. We also show that these improvements make the calibrated mechanistic model suitable to be used as a generative model to predict cellular responses to unseen perturbations. To our knowledge, this is among the first methods to explicitly integrate mechanistic GRN inference, trajectory reconstruction, and simulation of realistic datasets into a unified framework for scRNA-seq time series analysis.

16

PERREO: An integrated pipeline for repetitive elements analysis enables the repeatome expression profiling in cancer

Rodriguez-Martin, F.; Masero-Leon, M.; Gomez-Cabello, D.

2026-04-10 bioinformatics 10.64898/2026.04.08.714730 medRxiv

Top 0.3%

12.0%

Show abstract

Transcriptome-wide profiling of repetitive elements expression reveals transposable element-derived transcripts that are deregulated in diverse biological contexts including cancer. However, most RNA-seq pipelines are optimized for annotated genes and substantially undercount repeat RNA molecules, limiting their discovery and characterization. Here we present PERREO, a comprehensive, user-friendly pipeline for analyzing repetitive RNA elements from short- and long-read sequencing data. PERREO performs quality control, repeat-aware alignment and quantification, differential expression analysis, co-expression network analysis, and de novo transcript assembly with minimal computational expertise required. We validate PERREO across cell lines, tumor tissues and liquid biopsies, demonstrating superior sensitivity to repetitive RNA signatures compared with standard RNA-seq approaches. PERREO integrates predictive modelling to identify biological associations and generates publication-ready visualizations. By removing the bioinformatic barrier to repetitive RNA discovery, this pipeline enables broader investigation of the repeatomes role in cellular biology and disease, yielding valuable results that, for specific analytical objectives, outperform certain existing tools and pipelines.

17

Decoding TF-Specific Predictability in Cross-Species Binding Site Inference

Wang, Y.; Liu, G.; Wang, Y.; Zhang, Y.

2026-04-16 genomics 10.64898/2026.04.14.718438 medRxiv

Top 0.3%

12.0%

Show abstract

Accurately identifying transcription factor (TF) binding sites across species is essential for understanding conserved gene regulatory mechanisms. While experimental techniques such as ChIP-seq have enabled genome-wide TF-binding maps, their application is often constrained by the limited availability of high-quality antibodies. Computational approaches that leverage data from one species to predict TF-binding sites in other species have emerged as valuable alternatives. However, existing models often rely on uniform modeling assumptions, overlooking substantial variability in cross-species predictability across TFs. In this study, we systematically evaluated the cross-species predictability of 137 TFs using 425 human-mouse ChIP-seq dataset pairs matched by cell type, and identified key biological features underlying this variability. Building on these insights, we developed ChromTransfer, a TF-aware cross-species prediction framework that integrates DNA sequence, functional conservation, TF-specific co-binding signals, and shared chromatin context signals. These regulatory signals substantially improve prediction performance, particularly for TFs with weak or absent motif enrichment. Together, this study establishes a biologically informed and scalable framework for TF-specific cross-species TF-binding site prediction and provides a practical strategy for extending regulatory annotations across species.

18

SPEAR: Predicting Gene Expression from Single-Cell Chromatin Accessibility

Walter-Angelo, T.; Uzun, Y.

2026-04-14 bioinformatics 10.64898/2026.04.13.717809 medRxiv

Top 0.3%

11.9%

Show abstract

Single-cell multiome assays enable direct measurement of chromatin accessibility and gene expression within the same cell. Still, most experimental designs remain constrained to two (and, less commonly, three) modalities per cell. This limitation motivates computational models that can predict unmeasured layers and, simultaneously, help dissect how cis-regulatory accessibility relates to transcription at gene resolution. Existing cross-modal methods often prioritize latent alignment or modality reconstruction, making it difficult to isolate the impact of model inductive bias under a shared cis-regulatory feature definition. We present SPEAR, a configuration-driven framework for gene-centric regression of single-cell gene expression from chromatin accessibility using a fixed transcription-startsite-centered representation shared across model families. Here we show that, under identical features, splits, and evaluation, model performance stratifies reproducibly across two multiome systems (mouse embryonic development and human hemogenic endothelium), with transformer encoders achieving the strongest mean test correlations (0.546 and 0.470, respectively). Per-gene performance distributions reveal substantial heterogeneity in predictability, indicating that accessibility-driven signal is concentrated in a subset of genes across contexts. Shapley value-based feature attribution further localizes predictive signal to promoter-proximal bins, with feature importance decaying with distance from the transcription start site, supporting a promoter-centered regime of cis-regulatory control within the modeled window. Together, these results provide a controlled comparison of inductive biases for chromatin- to-expression prediction and deliver analysis-ready outputs for gene-level interpretation. SPEAR is open source and publicaly available for use at https://github.com/UzunLab/SPEAR. Supplementary data are available.

19

found: Inferring cell-level perturbation from structured label noise in single-cell data

Afanasiev, E.; Goeva, A.

2026-04-14 bioinformatics 10.64898/2026.04.10.717768 medRxiv

Top 0.3%

11.9%

Show abstract

Recent work by Goeva et al. introduced HiDDEN, a method for refining batch-level labels to infer cell-level perturbation without prior knowledge of affected populations, addressing the mismatch between sample-level labels and heterogeneous perturbation effects across cells. Here, we present found, a Python and R implementation of HiDDEN, supporting pipeline customization, by-factor grouping, hyperparameter selection, and visualization. Through benchmarking across diverse datasets, we show that performance depends strongly on modeling choices, particularly regression, grouping, and embedding dimensionality. found provides a practical, flexible, and accessible framework for robust cell-level perturbation analysis.

20

Cell type composition drives patient stratification in single-cell RNA-seq cohorts

Halter, C.; Andreatta, M.; Carmona, S.

2026-03-31 bioinformatics 10.64898/2026.03.27.714811 medRxiv

Top 0.3%

10.4%

Show abstract

Early transcriptomic studies demonstrated that unsupervised analysis of bulk gene expression can reveal clinically meaningful patient subgroups. Single-cell RNA sequencing (scRNA-seq) provides high-resolution characterization of cellular heterogeneity and therefore enables more refined patient stratification. Several computational approaches have been proposed to summarize single-cell data into sample-level representations for cohort-level exploratory analyses. However, these methods generally do not explicitly account for the compositional nature of cell-type proportions. Based on eleven scRNA-seq cohorts across different biological conditions, we evaluated several state-of-the-art sample representation methods for their ability to recover known biological groupings in an unsupervised setting. Surprisingly, we found that baseline approaches based on cell-type composition and pseudobulk gene expression consistently matched or outperformed more complex methods while requiring orders of magnitude fewer computational resources. In particular, centered log-ratio-transformed cell-type proportions achieved the highest stratification performance and demonstrated robustness to batch effects. The stratification signal was frequently concentrated in a small subset of highly variable cell types, and performance was robust across diverse cell type annotation strategies. Altogether, these results suggest that clinically relevant inter-sample variation in scRNA-seq cohorts is largely driven by differences in cell-type composition. Importantly, compositional representations directly link cohort-level structure to specific cell populations, enabling mechanistic interpretation and facilitating clinical translation. We provide scECODA, an open-source R package for scalable and interpretable cohort-level Exploratory COmpositional Data Analysis of scRNA-seq data, and establish cell-type compositional representations as a powerful and interpretable baseline for unsupervised patient stratification.